Íaeáîîêëáìì Çç Äááçêaeáá¸ëëae Áááç

نویسندگان

  • Stephen Lau
  • William G. Griswold
چکیده

OF THE THESIS Using a Proxy to Enable Communi ation Overlap with Computation by Stephen Lau Master of S ien e in Computer S ien e & Engineering University of California, San Diego, 2003 Professor S ott B. Baden, Chair Parallel omputation with distributed memory has two key phases: lo al omputation on the nodes of the system, and ommuni ation between these nodes. While ommuni ation speeds have in reased with te hnologi al improvements, the level of growth an not mat h the rapid in rease of speed of pro essors leading to a rising imbalan e in the ost of ommuni ation relative to omputation. Overlap of ommuni ation with omputation is thus desirable as it helps to amortize the ost of ommuni ation laten y by hiding it during periods of omputation. Typi ally, nodes in a parallel ma hine ommuni ate via message passing libraries su h as MPI. However, MPI is not always able to realize overlap, even with the asyn hronous routines provided. The MPI standard does not mandate when a message has to be delivered, and an often delay transmission of messages arbitrarily. Also, larger messages may ex eed system pre-set syn hronous message limits thus delaying the return of ontrol to the sender and in reasing the per eived overhead of the message passing system. We present a hybrid model of omputation ombining the two most ommon paradigms of parallel ommuni ation: message passing, and shared memory, to enable a Proxy thread running on a dedi ated CPU on an lustered SMP system to maximize overlap of ommuni ation with omputation. We gathered results from two SMP lusters; LeMieux with 4-way nodes, and Blue Horizon with 8-way nodes. viii Chapter I Introdu tion There have been many trends for design of parallel omputers. Historially, parallel omputers have been omposed of highly spe ialised and expensive unique super omputers. However, this proved to be too expensive to build, leading to designs attempting to use heaper omponents. Multiomputers, omposed of networks of single pro essor omputers have been on one end of the spe trum (Figure I.1). Another design used multi-pro essor omputers whi h were single ma hines with multiple CPUs within the ma hine. This led to the next obvious trend whi h was to ombine the two single tier hierar hies and use lusters of multi-pro essor omputers to form a multi-tier hierar hy (Figure I.2) [12℄ [11℄. Figure I.1: Single-tier Computer System 1 2 Figure I.2: Multi-tier Computer System The relative ost of omputational power has fallen with Moore's Law making lusters of workstations or multi-pro essor omputers more feasible and ost-e e tive. Commodity-o -the-shelf omputers and workstations, due to high sales volumes, and more rapid te hnologi al innovation provide superior ostperforman e ratios as ompared to more traditional parallel ma hines or mi roomputers. However, the level of advan ement in ommuni ations te hnology has not kept up with the level of growth of omputation, leading to a disparity in omputation vs. ommuni ation.[14℄ The growth of omputational power is qui kly out-pa ing the growth of ommuni ation, leading to a highest ost for ommuniation within parallel algorithms. Hardware solutions have attempted to remedy this by o -loading some omputational load of ommuni ation proto ols to onboard pro essors on the network interfa es themselves, but in addition to being expensive, they often an't o -load basi software proto ol overheads. With the de reasing osts of symmetri multi-pro essing (SMP) ma hines, lusters with high omputational power are be oming more and more available at heaper osts while ommuni ations osts remain at a higher premium. This is a signi ant on ern in lusters of multi-pro essor omputers, as 3 they typi ally are limited to one network input/output devi e per node. In multiomputers, where ea h node has only one pro essor, the network I/O devi e is more apable of handling the ommuni ation load generated by the single pro essor. Multi-pro essors, whi h have no external network, don't have a ommuni ations load problem as ea h pro essor is onne ted to every other pro essor through the ma hines internal bus. However, with lusters of SMPs, the network I/O devi e must be shared between all the pro essors within the node. Ea h node an have 4, 8, 16, or even more pro essors per node. As more pro essors are added, the ostliness of ommuni ations rises. Mult-tier ar hite tures attempt to save on ommuni ation osts by aggregating pro essors onto a node. For example, given a system whi h has 512 pro essors total, a 128 node arrangement where ea h node has 4 pro essors might be slightly faster in that its available bandwidth per pro essor is higher than that of a 64 node 8-way arrangement. However, as more nodes are added to the interonne t, the ost of the swit h be omes more expensive, sometimes prohibitively so. Multi-tiered omputers must balan e the trade-o of ost of the inter onne t versus the available ommuni ation apabilities for ea h node on the inter onne t. While narrowing the width of the node by de reasing the number of pro essors per node will in rease ommuni ations performan e, the in rease in ost will be ounter to the whole point of using ommodity omponent lusters! I.A The need for overlap The most ommon means of implementing ommuni ation between nodes of a multiomputer system is by message passing. Message passing software libraries su h as MPI[22℄ (Message Passing Interfa e the industry standard for message passing) often do not provide an easy way to eliminate ommuni ations software overheads. Their basi message passing routines are syn hronous resulting in blo ks of omputation with global syn hronisation points where all lients 4 syn up and ommuni ate. This model of parallel omputation is alled Bulk Synhronous Parallelism (BSP) [4℄. They were also alled Loosely Syn hronous problems by G. C. Fox [13℄. Smaller messages an often be sent immediately without requiring a syn hronous re eive all to be posted prior (referred to as Rendezvous mode ommuni ations) but on e they ex eed the Eager Limit1 set by the system, then the syn hronous re eive all must be alled on the re eiving side before the send all will return ontrol ba k to the aller 2 . MPI gets around this by o ering non-blo king asyn hronous routines in an attempt to provide overlap of ommuniation with omputation by allowing the user to setup or "post" ommuni ation alls that will return qui kly, allowing the program to ontinue on into a period of omputation while the system asyn hronously handles the ommuni ation in the ba kground. However, even these aren't foolproof. The MPI standard doesn't mandate or even set a eiling for when a message must be delivered. It an be held and delayed for an arbitrary amount of time before delivery. [22℄ Overlap strategies involving these non-blo king ommuni ation alls ne essitate network interfa e polling [20℄ [18℄ whi h ompromises performan e by for ing the system to periodi ally ontext-swit h to handle the system-level polling alls. At a more basi level, it is also possible that the message layer library (MPI) may not be optimised to the hardware platform orre tly and may not be able to take advantage of any intelligent asyn hronous ommuni ation apabilities the hardware or network interfa e may have. The ost of message passing proto ols su h as MPI an be broken down into two basi omponents: overhead and transmission time. The overhead is the ost of the message startup: essentially, all the time involved aside from the a tual message transmission. This in ludes omponents su h as data serialisation, 1The system's de ned Eager Limit is the largest point-to-point message whi h will be a epted by a syn hronous ommuni ation-send if no mat hing re eive all has been posted prior, allowing the syn hronous send to return ontrol ba k to the aller immediately similar to an asyn hronous send. 2http://www. s.unb. a/a rl/training/general/programming ibm power3/sld048.htm http://www.lanl.gov/orgs/ i / i 8/para-dist-team/MSGPASS/MPI/... IMPLEMENTATION ISSUES/CHARACTERISTICS/unexpe ted.html 5 proto ol sta k startup, memory bu er pa king, et . The transmission time is the a tual network transfer ost of sending the omplete message. The overhead is mainly limited by on-node resour es su h as the omputational ability to qui kly pa k data and get the data to the network interfa e; it is not limited by bandwidth or the inter onne t itself, unlike the transmission time whi h is governed almost solely by the bandwidth of the inter onne t. In asyn hronous alls su h as those provided by the immediate mode MPI alls, only the transmission time an be overlapped. The overhead ost of alling the basi ommuni ations primitives an not be avoided simply. Overlap using asyn hronous alls does however a hieve the goal of overlapping the per eived ommuni ation time with omputational periods thus amortizing the ost of the network transfer out among the omputation time period. This is all of ourse dependent on the MPI library's ability to realise overlap using asyn hronous ommuni ation to begin with. In our Results hapter we provide data showing one of our ma hine's MPI library's inability to attain overlap using MPI's Imode asyn hronous ommuni ation primitives. Di erent lasses of problems lend themselves to overlap di erently. Independent, or embarrassingly parallel omputation lasses an be rewritten quite easily to utilise overlap as iterations or datasets are relatively independent of ea h other. However, problems where iterations and/or data are dependently intertwined, or problems where frequent periodi syn hronisation points are required are often harder to a hieve overlap with. I.B Enabling Overlap via a Proxy Proxies are useful as a intermediary ommuni ations manager. They allow nodes to fo us on being omputational lients, and they an help amortize the ost of ommuni ations on multi-pro essors [17℄. In addition, be ause lusters of multipro essors are omposed of SMP nodes, proxies give an opportunity to 6 a hieve overlap on ma hines that may not support ommuni ation overlap via onventional asyn hronous alls. Consider a multi omputer with four-way SMP nodes; one an set aside one CPU per node as the proxy for the remaining three CPUs. Rather than having ea h CPU in ur its own network I/O osts for ea h Send/Re eive of data, the proxy (typi ally the rst of the SMP CPUs, for no ne essary reason) in urs all osts. In a system of single-pro essor nodes, using onventional overlap te hniques (su h as asyn hronous message passing alls) allow the transmission time to be hidden (as noted above), however the osts asso iated with the overhead of the messaging system an not be hidden. The usefulness and pra ti ality of a proxy be omes more evident on lusters where the nodes are symmetri multi-pro essing (SMP) ma hines. Using one of the CPUs on an SMP node to manage all the ommuni ation will allow these overhead osts to be in urred by the proxy leaving the other CPUs free to ontinue pro essing, thus eliminating a serial portion of the ommuni ation ode. The downside to this is, of ourse, the loss of one of the CPUs from the total omputational power of the node if using a dedi ated CPU for the proxy. If sharing a CPU between a proxy and ompute lient, thrashing of that CPU and its a he may o ur leading to disproportionate run-times for that ompute lient for ed to share its CPU with the proxy. If running a syn hronous algorithm, this may in turn slow down the other CPUs for ed to syn hronise with the shared ompute lient. While we are interested hie y in proxies as a means of obtaining better ommuni ations overlap, they an also be used as a means of giving prote ted a ess to a single network I/O interfa e as mentioned in the paper by Lim, Heidelberger, Pattnaik, and Snir.[17℄. This is often an important onsideration, as many libraries su h as MPI are not guaranteed to be thread-safe. MPICH (a free implementation of MPI popular under Linux and BSD lusters)3 for exam3MPICH A Portable MPI Implementation an be found at 7 ple is not thread safe. Use of a proxy allows all MPI ommuni ations routines to be alled from the proxy thread, maintaining prote ted a ess to the network interfa e. IBM's implementation of MPI[2℄, for instan e, is not guaranteed to be thread-safe. This prote ted a ess also leads to higher throughput ommuni ation, as the single proxy thread saves the network I/O from being shared and swit hed between multiple other threads. I.C Plan and Summary of Results The work in this thesis will extend work done by Fink[11℄ and Baden[3℄. We attempt to show that for ertain lasses of problems, using a ommuni ation proxy will prove advantageous over asyn hronous immediate mode MPI alls in enabling overlap of ommuni ation with omputation. For the two appli ations we implement, a partial di erential equation solver (RedBla k3D), and matrix-matrix multipli ation (SUMMA), we observed mixed results. We observed that getting the best performan e out of problem retooled to use a proxy required balan ing between the size of the problem as well as the node on guration used to solve the problem. Smaller problem sizes tend to not provide enough ommuni ation, and larger problem sizes tend to provide too mu h omputation, thus making a proxy bene ial for some lasses of problems, but not in the general ase. http://www-unix.m s.anl.gov/mpi/mpi h/ Chapter II Motivating Appli ations II.A Finite Di eren e Method The rst appli ation is alled RedBla k3D, a solver whi h solves a partial di erential equation (Poisson's equation) in three dimensions. It uses a su essive over-relaxation method, Gauss-Seidel's method to solve the dis rete equation using a seven-point nearest-neighbour (in all three dimensions) sten il. The basi strategy of RedBla k3D, not taking into a ount the multi-tier hierar hy of the omputer, is to divide the grid a at uniform distribution hunks of data evenly distributed a ross all nodes. Figure II.1 shows an example distribution of data a ross four nodes. The RedBla k3D algorithm is divided into two phases: Figure II.1: The single-tier syn hronous solution, partitioned in a at uniform distribution a ross 4 nodes, showing the halo region of ghost ells for ea h node 8 9 Figure II.2: The single-tier asyn hronous solution, partitioned in the same at uniform distribution a ross 4 nodes, showing the same halo region of ghost ells. The shaded region denotes the outer annulus whi h is dependent upon the ompletion of the asyn hronous ommuni ation. Figure II.3: The multi-tier solution, partitioned at the node level on a 4-way SMP, showing the halo as the shaded region. The annulus is depi ted in the outer regions labelled 4-7. The interior is labelled as regions 0-3, one for ea h pro essor on the node. 10 ST Relax() f initiate ghost ell ommuni ation wait for ghost ell ommuni ation ompletion serialRelax(U[pro entire℄) g Figure II.4: Pseudoode for the RedBla k3D single-tier relaxation kernel MT Relax() f initiate ghost ell ommuni ation serialRelax(U[pro inner℄, rhs[pro inner℄) wait for ghost ell ommuni ation ompletion serialRelax(U[pro annulus℄) g Figure II.5: Pseudoode for the RedBla k3D multi-tier relaxation kernel MT Relax() f initiate ghost ell ommuni ation parallel for (ea h pro essor in node) serialRelax(U[nodeinner℄, rhs[nodeinner℄) wait for ghost ell ommuni ation ompletion parallel for (ea h pro essor in node) serialRelax(U[nodeannulus℄, rhs[nodeannulus℄) g Figure II.6: Pseudoode for the RedBla k3D proxy relaxation kernel 11 omputation and ommuni ation. In the syn hronous, non-overlapped version single-tier version, the subproblems are de ned solely as one region per node, with a ghost ell region bordering the sub-problem. This ghost ell region is opied during the ommuni ation phase from pro ess that owns the neighbouring sub-problem. The omputation phase then omputes and updates the lo al sub-problem, and sends out the border region to its neighbours so they an update their own ghost regions. In the asyn hronous, overlapped single-tier version, the sub-problem is further divided into regions that depend on the ghost ells ( alled the annulus), and the region that is independent of any ghost ell updates ( alled the inner region). The two phases are then further sub-divided so that a rst ommuni ation phase is started where asyn hronous ommuni ation is initiated to re eive the ghost ells from its neighbours. Lo al omputation is then done on the inner region. After the inner omputation is ompleted, the pro ess waits to ensure the asyn hronous ommuni ation has ompleted; upon ompletion, the omputation updates the annulus region. After the annulus region has ompleted, it is sent to its neighbours so they an update their ghost ell values. In the asyn hronous, overlapped multi-tiered proxy version, the subproblems are de ned at the node level rather than the pro essor level. The annulus and inner regions are de ned the same way as in the asyn hronous single-tier version, ex ept now the annulus region is wrapped around the entire node, rather than just the pro essors. The phases are the same in that a ommuni ation phase is rst run initiating asyn hronous ommuni ation on the ghost ells from the node's neighbours. The lo al omputation divides up the inner region into p smaller disjoint sub-blo ks (not ne essarily uniform) whi h are mapped to the number of pro essors within the node. The pro essors ompute and update their own subblo ks. Upon ompletion, the algorithm waits to ensure ghost ell updates have ompleted before dispat hing the pro essors to ompute and update the annulus region and ommuni ate the updated values to the node's neighbours. Figure II.3 12 shows the partition and distribution of data at the node level, depi ting the annulus and partitioning of the inner region. In the MPI native non-blo king asyn hronous model, ommuni ation is attened out into the single-tier model, whi h ignores the node-level ommuniation. All ommuni ation is done pro essor to pro essor. An MPI geometry is de ned whi h blo ks all the pro essors on a node into a grid, thus minimizing o -node ommuni ation. However, the proxy model maps more losely onto the multi-tier model, eliminating pro essor-level ommuni ation. In the proxy model, pro essors do not ommuni ate to other pro essors o node. All ommuni ation o -node is handled at the node level by the proxy thread. The non-overlapped model will of ourse tend to have the highest ommuni ation ost as the pro esses wait for ommuni ation to omplete before ontinuing on to the omputational update phase. The overlapped single-tier model attempts to overlap the ommuni ation with the lo al omputation. Communi ation in this single-tier overlapped model will o ur at two speeds as ommuni ation between two neighbouring proesses residing within the same node should be fast as it does not have to go out to the inter onne t while two pro esses residing on di erent nodes will be slowed by the inter onne t. Communi ation in the multi-tier proxy model should be more optimal, as it eliminates ex essive annulus region partitioning be ause the annulus is de ned at the node level rather than the pro essor level. II.B Matrix-matrix Multipli ation The se ond appli ation implemented was Van de Geijn's SUMMA (S alable Universal Matrix Multipli ation Algorithm) [15℄ [1℄. SUMMA is a fast matrix to matrix multipli ation algorithm whi h allows for fast s alable matrix multipli ation. More importantly, the SUMMA algorithm is very easy to re-write to enable overlap. Essentially, SUMMA uses broad asting of data within row and olumn groups to more eÆ iently ompute the produ t, as shown in Figures II.7-II.9. The 13 Figure II.7: Phase 1 All pro essors in the same olumn group as the root node broad ast their sub-matri es of A to all the nodes in their same row group (i.e.: laterally along the mesh) lo al sub-problems are omputed using the BLAS level 3 dgemm matrix-multiply kernel whi h is generally provided as part of the system libraries. A detailed des ription of the algorithm an be found in [15℄. A brief overview is ne essary here to explain how overlap is a hieved, however. The proessors are partitioned into a r x mesh of nodes, with the two-dimensional matri es overlaid on top. In Figures II.7-II.9, a 4x4 grid of pro essors is overlaid on the matrix to partition it up into 16 equally sized sub-matri es. Ea h sub-matrix is of dimension i rows by j olumns. For our purposes, sin e we're operating on a uniform square matrix, i = j. The matrix-matrix multipli ation is then exe uted as a sequen e of rank-1 updates, where ea h node loops over the ommon dimension k (given that A=m x k and B=k x n). The iteration runs in panels, pro essing several k together to bene t from the use of lo ality and stride. Within the loop (iterating with i from 0 to k as mentioned), the algorithm works in three phases. During phase 1 (Figure II.7), all pro essors in the ith olumn broad ast a opy of theirA sub-matrix to all the other 14 Figure II.8: Phase 2 All pro essors in the same row group as the root node broad ast their sub-matri es of B to all the nodes in their same olumn group (i.e.: longitudinally along the mesh) pro essors in the same row group (e.g.: a lateral broad ast along the rows of the mesh). The re eiving pro essors re eive this sub-matrix into a lo al bu er R. In phase 2 (Figure II.8), all pro essors in the ith row broad ast a opy of their B submatrix to all the other pro essors in the same olumn group (e.g.: a longitudinal broad ast along the olumns of the mesh). The re eiving pro essors re eive this sub-matrix into a lo al bu er S. In the third and nal phase (Figure II.9)e, ea h pro essors does its lo al dgemm operation. The root node (the node whi h owns both the ith row and olumn) does a dgemm of the two lo al sub-matri es of A and B. The nodes that broad ast laterally do a dgemm of their sub-matrix of A with the re eived sub-matrix of B in S (from phase 2). The nodes that broad ast longitudinally do a dgemm of their sub-matrix of B with the re eived sub-matrix of A in R (from phase 1). The remaining nodes in the mesh do a lo al dgemm of the re eived sub-matrix of A in R (from phase 1) with the re eived sub-matrix of B in S (from phase 2). For the syn hronous model, the algorithm follows a basi model where for 15 Figure II.9: Phase 3 Lo al dgemm phase where the root node (hat hed) multiplies the two sub-matri es of A and B it owns together. The neighbours (shaded) in its row group matrix-multiply their sub-matrix of B with the re eived sub-matrix of A from Phase 1. The neighbours (shaded) in its olumn group matrix-multiply their sub-matrix of A with the re eived sub-matrix of B from Phase 2. The other nodes matrix-multiply both the re eived A and B sub-matri es. Cmn = 0 for i = 0 to (k 1) f broad ast Ami to members of row group broad ast Bin to members of olumn group Cmn+ =dgemm(Ami x Bin) g Figure II.10: Pseudoode for the SUMMA operation for C = AB 16 ea h iteration, the row broad ast is done rst, followed by the olumn broad ast both done using syn hronous ommuni ation as the data being re eived is ne essary for the dgemm sub-matrix multipli ation. The dgemm is then run, and the next iteration is then started. Enabling overlap is relatively straightforward on the SUMMA algorithm due to the lean separation and relative independen e of ommuni ation within ea h loop iteration. The very rst iteration is dependent upon data already being there, so the R and S bu ers are lled syn hronously for this rst iteration. On e that is omplete however, for ea h iteration i, asyn hronous ommuni ation is started on the i+1 iteration before starting the dgemm operation on the i sub-matrix multipli ation. When the dgemm is omplete, the node waits on the asyn hronous ommuni ation for i+1 to ensure that all the data ne essary for the i+1 iteration is re eived properly. On e the data transfer is omplete, the next iteration i+1 is started, thus asyn hronously starting the ommuni ation for the i+2 iteration and so on, so forth. This makes for a relatively easy to overlap algorithm. Chapter III Implementation III.A Overview We propose using a dedi ated "Proxy" thread running as a dedi ated thread running on a dedi ated CPU of an SMP node. This proxy then manages ommuni ation for that entire node. We hope that by using a hybrid ommuniation model in orporating both shared memory threads and message passing we an enable better performan e, and a hieve easier overlap. III.B What does the proxy provide? III.B.1 Network devi e sharing and prote tion As mentioned in the introdu tion, lusters of multi-pro essor nodes typi ally only have one network interfa e devi e per node. Uni-pro essor ma hines have ex lusive a ess to the network devi e allowing for unlimited use. However, this is wasteful in that the network devi e isn't always in use. There are inevitably periods of non-use within many parallel algorithms. In an SMP node, we an ll in the periods of non-use by enabling the other pro essors to share the network devi e making expensive ommuni ations 17 18 interfa es more ost-e e tive by bringing up their utilisation per entage. The extra threads reate more ommuni ation, whi h an then be multiplexed by the proxy thread onto the network devi e. However, if all pro essors share the network interfa e, then ontention for the single hardware devi e rises. While some implementations of MPI are threadsafe, it is not guaranteed that all are. Also, even if they are thread-safe, there is a ontext-swit hing overhead in urred when the s heduler must manage permission ontrol of the network interfa e between multiple pro essors or threads that request a ess to it. The system must manage a ess to the network devi e with some sort of mutual ex lusion. With the proxy however, all ommuni ation is managed by the single proxy thread. This frees up the other threads from having to need to swit h a ess to the network interfa e, allowing one thread to retain ex lusive a ess and ontrol of the interfa e. This eliminates any prote tion or lo king issues, as well as eliminating the ontext-swit h overhead mentioned above. III.B.2 A dedi ated ommuni ations manager A proxy a hieves better performan e by allowing overlap, as mentioned. It allows a separate proxy thread to handle ommuni ation so that omputational lients running in other threads don't waste y les on ommuni ations routines, and an instead devote their pro essing power to omputational y les. Regular asyn hronous routines provide ways of hiding the laten y of ommuni ations. However, proto ol overheads an not be amortized away with these asyn hronous routines. A proxy allows for redu ed laten y, as well as eliminating the proto ol overheads thus resulting in in reasing e e tive bandwidth. Falsa and Wood show [10℄ that a proxy running on a dedi ated CPU an give a bene t for light-weight ommuni ations proto ols in appli ations that exhibit high amounts of ommuniation, i.e.: where ommuni ation is a bottle-ne k. By eliminating ommuni ation portions of the algorithm, and allowing other threads to fo us solely on omputation, the dedi ated proxy thread helps 19 eliminate ontext overheads in urred between swit hing from omputation to ommuni ation management. Be ause all the management is done by a dedi ated thread on its ex lusive pro essor, the other threads running on the remaining pro essors a hieve better a he lo ality by onstantly staying within their omputational phases without having to worry about ommuni ations management. An analogy an be made between a group of n engineers. While having all n engineers do development work is one solution; in pra ti e, it is almost always better to have one manager to manage the team of engineers taking are of busy work su h as budgeting, s heduling, et . Essentially, taking are of the busywork freeing up the other team members to fo us on the engineering development ne essary to omplete the proje t. If all n engineers have to deal with their own paperwork involving balan ing budgets, or s heduling meetings, then they are less e e tive as ompared to having the one manager to manage them, even with the loss of the manager as a potential engineer. III.C Models of ommuni ation There are four permutations of ommuni ation models an algorithm an be lassi ed into. The issue of single-tier vs. multi-tier deals with how losely the algorithm is aware of the ar hite ture of the system. In single-tier formulations, the algorithm sees the system merely as a one-level forest of nodes, thus attening out any sense of a hierar hy (if one exists). Multi-tier formulations are more aware of the system ar hite ture, and thus an take advantage of system enhan ements and lo ality issues that the single-tier formulation may not. The issue of making an algorithm single-tier or multi-tier aware is a balan e between portability and performan e. Single-tier algorithms are more portable sin e any ar hite ture an be attened to a single-tier representation. While some performan e may be lost in the dis arding of system-awareness, it an often be ompensated for by hardware or system-level enhan ements (su h as making the under-lying ommuni ation li20 brary multi-tier aware, yet still retaining a single-tiered API). However, multi-tier algorithms whi h an map themselves to the system ar hite ture better an have a better idea of how to partition data, and optimize the distribution of data in order to minimize ommuni ation. Sin e this is done at the algorithm level, any system-level enhan ements done at the library level an not help out as mu h. Syn hronous versus asyn hronous ommuni ation deals with how the ommuni ation between nodes is negotiated. In a syn hronous ommuni ation model, the re eiver must be aware and ready to re eive a ommuni ation before the sender an omplete and return ontrol ba k to the aller (as mentioned in the introdu tion). In an asyn hronous ommuni ation model, the sender an initiate a sending ommuni ation before the re eiver is even aware that it will be re eiving data. The ontrol returns ba k to the sender who an go ahead and omplete other tasks. When it wants to know when the ommuni ation has been ompleted, it will wait on the ommuni ation thus making it syn hronous in that the wait will not return until the ommuni ation has ompleted. Overlap is a hieved by using either asyn hronous ommuni ation in a single-tier model or multi-tier model. Overlap is also a hievable in the multi-tier model by using a dedi ated proxy thread using either syn hronous or asyn hronous ommuni ation. Our data in this thesis ompares three models: single-tier syn hronous, single-tier asyn hronous, and multi-tier using a dedi ated thread. III.C.1 Message passing Message passing toolkits su h as MPI provide an expli it, learut means of doing both inter and intra node ommuni ation. On the MPI implementations running on LeMieux and Blue Horizon, ea h pro ess is spawned as its own heavyweight pro ess, rather than as a thread. This models the node environment more as a at single-tier system, as it abstra ts away the SMP layer, treating every pro ess (and thus CPU) as members of a at system of nodes. On an improperly 21 designed implementation of a message passing layer, it's possible that this may result in ex essive network usage due to the hiding of the SMP hierar hy. On both LeMieux and Blue Horizon, the MPI runtime is designed to shortir uit the network for intra-node messages, i.e.: any messages going from one MPI pro ess to another MPI pro ess residing on the same node will short ut the network interfa e and stay within the node. However, this still results in a deep memory opy from the address spa e of the rst MPI pro ess to the se ond. On Blue Horizon, the point to point transmission time of two pro essors on-node is 12 s, versus 140 s for an o -node message sent using MPI syn hronous primitives. On LeMieux, the point to point transmission time of two on-node pro essors is 12 s, with an essentially identi al time of 13 s for o -node messages. These times are for point to point messages onsisting of one double. For longer messages of 4096 doubles, LeMieux takes approximately twi e as long (79 s on-node, versus 154 s o -node). Message passing toolkits' main advantages are in its more expli it ontrol ow, as ompared to shared memory toolkits thus leading to easier development and subsequent debugging. While shared memory systems are onvenient for intranode ommuni ation, distributed shared memory (DSM) systems abstra t away the distribution of memory a ross remote nodes, resulting typi ally in worse performan e. Higher performan e is possible, if "hints" are provided, or if the shared data is laid out in a spe i manner in whi h to minimise oheren e transa tions. However, to a hieve the best performan e, it is often ne essary to provide as mu h ontrol information as one would need to program in a message-passing environment [18℄ III.C.2 Shared memory Other parallel pro essing toolkits su h as OpenMP or POSIX threads (pthreads) support a shared memory method of sharing information between proesses. Shared memory provides an easier interfa e than the message-passing paradigm, as it is simply an extension of the serial memory programming interfa es 22 with whi h the programmer is already familiar with. However, shared memory implementations to allow o -node a ess are often omplex or ompli ated, and have longer laten ies or higher overheads than the more simple message passing toolkits hen e the popularity of MPI vs. threads. Shared memory libraries are often faster for intra-node ommuni ation however, as shown by the thread pa kages' behaviour on both LeMieux and Blue Horizon. Threads are spawned as light-weight pro esses, or threads. This enables them to have shared pages of memory mapped between threads. As mentioned above, however, using distributed shared memory for remote node memory a ess often results in poor performan e. While semanti s make for more onvenient usage, the osts are often hidden in the ease of use. Abstra ting away the "remote-ness" of data merely allows the programmer to onveniently ignore where that data resides, resulting in medio re or sub-par performan e as ompared to message passing environments whi h for e the programmer to address where the data a tually resides. Typi ally, parallel algorithms are implemented using either message passing (on multiomputers, or lusters omposed of a network of single-pro essor nodes), or shared memory (on multi-pro essors, or a single omputer omposed of multiple pro essors). The shared memory paradigm does not map easily to a distributed network of multi-pro essors. The overhead of attempting to maintain shared memory semanti s overlaid on top of (typi ally) a message passing ommuni ations layer between nodes auses signi ant overheads, despite the bene ts of maintaining an easy to use programming paradigm and syntax [18℄. However, within threads running within the same node on a multi-pro essor ma hine su h as the SMP nodes utilized in this work, shared memory often provides better performan e and lower overheads than message passing toolkits. Sharing pages of memory between lightweight threads, and allowing the threads to maintain their own prote tion model eliminates many overheads as ompared with more heavy-weight message passing 23 pro esses. Be ause the threads share the same address spa e, ontext-swit hing overheads between threads are signi antly redu ed as ompared to pro esses as well. Often threads are able to give bene ts when parallel ode done solely with MPI su ers from load balan ing, or memory limitations [21℄ [8℄. Load balan ing an be hard to setup and manage in a system of pro esses ommuni ating with messages. For example, a master-worker job queue an be hard to manage with solely message passing. Workers may take too larger a job sli e, leading to load imbalan e. On the other hand, taking too little work leads to frequent and ex essive ommuni ation. Shared memory allows for ner grain load balan ing as workers an take work as they need it from a shared data area without expli it messages. Memory limitations an often be involved when spawning multiple heavy-weight pro esses on one node as ea h pro ess must take its own sta k and in ur its own overhead. Lightweight threads working with a shared data area often enjoy lower overheads leading to lower memory requirements. III.C.3 POSIX threads vs. OpenMP threads In our ndings, we ompared using POSIX threads (pthreads) versus OpenMP threads. The main di eren es lie in the s heduling and pla ement of the respe tive threads on dedi ated or shared CPUs. Pla ement of both OpenMP and POSIX threads is ontrolled by the operating system, with a job queue asso iated with ea h of the CPUs within a node. The management of the queues is ontrolled at the operating system level. Working with the Pittsburgh Super omputing Centre support sta , it was determined that di eren es in pla ement over time in response to system load varied between pthreads and OpenMP threads. OpenMP allo ates a xed number of threads as determined by the OMP NUM THREADS environment variable upon initialisation of the omp parallel pragma dire tive. Typi ally, if the number of threads mat hes the number of CPUs on a node (in the ase of LeMieux, this is four), then the threads are usually 24 s heduled onto separate and dedi ated CPUs, regardless of the potential workload. The POSIX threads pa kage however will not move a newly instantiated thread to its own pro essor unless the urrent pro essor has a high enough workload to mandate migrating a thread to a new CPU. If the total aggregate workload is not enough to warrant another pro essor, then the pthreads pa kage will run the multiple threads on a smaller subset of physi al pro essors for ing two or more threads to share a pro essor. The idea behind this is to improve a he a esses/misses for multiple threads that may be sharing memory. While initial behaviour of both pa kages is di erent (i.e.: initial pla ement of a newly reated thread), the prolonged behaviour of both thread pa kages resulted in similar s heduling poli ies. Both the OpenMP and pthreads pa kages tend to bind threads to a pro essor where they remain in pla e as long as workload is onsistent. The only ex eption is if the workload is trivial, in whi h ase OpenMP tends to keep the threads running on separate CPUs, whereas the POSIX threads will oales e onto a smaller subset of CPUs. In addition, there is the issue of portability. OpenMP is a thread library that has been standardized on by HPC (high-performan e omputing) vendors, with open support by the likes of HP/Compaq, IBM, et . There are many thread libraries out there, su h as pthreads, but OpenMP is the supported and a epted standard. Even if it is available, the pthreads pa kage does not often exhibit standard behaviour. The POSIX standard provides a standardized interfa e, but does not di tate portable behaviour su h as s heduling poli ies. The hoi e was made to use the OpenMP threads pa kage as it provided more expli it dedi ated CPU s heduling during the initialisation phase. Sin e we know that our omputational phases are designed to maximize CPU utilization, we know that our threads will need dedi ated pro essors. The POSIX threads will migrate the threads to their own dedi ated pro essor eventually, however, OpenMP saves us the ost of migrating the thread from one pro essor to another. 25 III.C.4 Utilizing a hybrid model While the MPI routines on both LeMieux and Blue Horizon use shared memory to by-pass and short ir uit going out to the network, we believe that a hybrid approa h of both message passing (using MPI) and shared memory (using OpenMP threads) will provide better performan e. The message passing paradigm maps well to heavy-weight pro esses; and one an indeed run multiple pro esses on an SMP ma hine, mapping one pro essor per pro essor. However, using a message passing library to go out onto the network to ommuni ation from one pro essor to another pro essor residing on that same node is extremely wasteful. MPI implementations on both LeMieux and Blue Horizon short ir uit the network by using shared memory to pass data between two ommuni ating pro essors within the same node. Other work on SMP ma hines has often used heavy-weight pro esses instead of the more light-weight threads we use in this work [17℄ [24℄. While reating one pro ess per physi al pro essor does provide more traditional messagepassing semanti s as well as ease of porting, we feel that the bene ts of utilizing a threaded model outweigh the bene ts of using pro esses. A threaded model provides redu ed swit hing/s heduling and opying overheads[18℄, allowing for faster ontext-swit hes. Utilizing a threaded model also allows for the usage of shared memory a feature riti al to eÆ ient queue management, as we use shared memory to a hieve fast intra-node ommuni ation. However, the expli it ontrol and semanti s of messages allow for more optimal and ost-e e tive inter-node ommuni ations. A hybrid approa h of using OpenMP threads to manage on-node ommuni ation using shared memory data stru tures, and MPI message passing routines to manage inter-node data transfer will bring the bene ts of both paradigms, utilizing the bene ts of ea h individual ommuni ation model to help o set the ons of the other. Essentially, going to the hybrid multi-tier model reates another layer in the ommuni ation ontrol levels. Previously with message-passing or shared 26 memory, there was just the Colle tive and Pro essor levels of ommuni ation ontrol. With a hybrid multi-tier model, we now have a new middle-level, Node level ontrol. In addition, by onsolidating the ommuni ations to one proxy thread we de rease the amount of pro esses ommuni ating on the network interfa e for the node. It has been shown that when fewer pro esses share an interfa e, the ontention de reases allowing for lower laten y and higher bandwidth to the interfa e. [9℄. In addition, by onsolidating messages, we eliminate any ex ess osts in urred by message start-up times. III.C.5 S heduling the proxy thread One issue with s heduling a proxy ommuni ator thread is whether or not to dedi ate a CPU in the SMP node to run the proxy thread, or whether the thread should share that CPU with a omputational lient. We adopt the terms dedi ated to indi ate the proxy thread running on its own dedi ated CPU, and shared to indi ate that the proxy shares a CPU with another omputational lient. In the dedi ated s enario, a CPU whi h an be used to run omputational threads is taken o to handle the proxy thread. This obviously redu es the peak performan e apability in terms of raw omputational power of the appli ation as a whole. However, this added ost must be weighed against the potential bene ts of o -loading the ommuni ation overhead out of the remaining omputational threads onto the proxy thread. A dedi ated s heduling s heme seems to be more appli able for appli ations that are very ommuni ation-intensive and exhibit another ommuni ation load to make the loss of omputational power worth the ost. In the shared s enario, the hope is that ommuni ation isn't enough to load the entire CPU that the proxy thread is running on, and that by sharing that CPU among both the proxy thread and another omputational lient, we an a hieve the best of both worlds. Pra ti e seems to indi ate that this isn't optimal, 27 as it is only pra ti al if the ommuni ation load isn't truly intensive enough to warrant using a proxy thread in the rst pla e. The potential thrashing and a heinvalidation due to ontext-swit hing of the threads redu es the performan e so that the performan e is not worth running a proxy thread at all. III.D Programming with the Proxy Any programmer familiar with OpenMP will be familiar with how to spawn OpenMP threads. To use a proxy ommuni ator, the developer must rst isolate the ommuni ations routines and rewrite the algorithm to separate ommuni ation-dependent data from ommuni ation-independent data. The basi stru ture of an algorithm written to utilize a proxy thread looks like the pseudoode shown in Figure III.2. A library was developed whi h followed a similar model, by running a thread full-time that would a ept MsgSend/MsgRe v primitives in an attempt to utilize a proxy while maintaining MPI/message-passing semanti s with minimal hange, but performan e was not optimal. A proxy requires that ommuni ation and omputation patterns be well known and partitioned in advan e, rather than ommuni ating messages as they o ur. To this end, a good interfa e that might be useful in the future would be to maintain a data stru ture separate from the main omputational stru ture that des ribes the work-set data. I.e.: it would give indexes and ranges partitioning a dataset seen in one of the nodes in Figure II.1 and des ribing it so it looks like the dataset seen in Figure II.2 from Chapter II. A suitable interfa e that we would build given time would look like the ode in Figure III.2. Indeed, this interfa e is similar to the MotionPlan and FloorPlan stru tures whi h are a part of the KeLP library [11℄ [12℄. 28 main() f for ea h iteration f #pragma omp parallel if (thread id = 0) f asyn hronous send ghost ell region data asyn hronous re v ghost ell region data wait for asyn hronous sends to omplete trigger omputational thread g else f do ommuni ation-independent omputation wait for trigger from proxy (thread id 0) omplete omputation g g g Figure III.1: Pseudoode for utilizing a Proxy ommuni ator thread 29 main() f ProxyThread proxy DataStru ture data DataCommuni ationPlan plan des ribe/partition data into plan for ea h iteration f proxy.StartIteration() do ommuni ation-independent omputation proxy.WaitForCompletion() omplete omputation g g Figure III.2: Pseudoode for a future Proxy library interfa e Chapter IV Results IV.A Testbed & Ar hite tural Details The proxy was implemented on two systems. The rst, LeMieux, is a 750 node luster lo ated at the Pittsburgh Super omputer Centre. Ea h node is an HP/Compaq AlphaServer ES45 system omposed of a 4-way SMP (symmetri multi-pro essor) on guration of four 1GHz Alpha 21264C EV68 pro essors. Ea h pro essor has a peak oating-point apability of 2 giga ops. Ea h pro essor has separate 64 Kbyte write-ba k L1 data and instru tion a hes, with 2-way set asso iativity and a a he line size of 64 bytes. The L2 a he is a 8 megabyte dual data rate a he, enabling it to perform two data fet h operations per lo k y le. It is a uni ed data and instru tion write-ba k a he, also with 2-way set asso iativity. 1 The nodes are onne ted with a Quadri s QSNet inter onne t, a high performan e network inter onne t designed for low-laten y, high-bandwidth ommuni ation with an advertised peak bandwidth of 340 MB/se , and a pro ess to pro ess laten y of 2 s (5 s for MPI messages). Ea h node has 4 Gigabytes of shared main memory, with a memory ross-bar swit h apable of swit hing 8 gigabytes/se ond. 2 1http://www.ps .edu/ma hines/t s/lemieux.html#optimization 2http://h18002.www1.hp. om/alphaserver/es45/ 30 31 The nodes run Compaq Tru64 Unix v5.1A, and provide parallel programming toolkits in the form of MPI, OpenMP, and Shmem. The Fortran ompiler used was the f90 Fortran90 ompiler. LeMieux provides both GNU and HP ompilers. The ode for this thesis was ompiled using the HP xx C++ ompiler. The Fortran ode was ompiled with the following options: -fno-se ond-unders ore -ff90 -fuglyomplex, while the C++ ode was ompiled with the following options: -O3 The se ond system used was Blue Horizon, a 144 node system lo ated at the San Diego Super omputer Centre. Ea h node is an 8-way IBM Power3 system onsisting of eight 375MHz Power-3 pro essors. Ea h node has 4 Gigabytes of shared main memory, with ea h pro essor having 1.5 GB/se bandwidth to memory. Ea h pro essor has an 8MB 4-way set asso iative L2 a he, and a 64KB 128-way set asso iative L1 a he. Both the L2 and L1 a he have a a he line size of 128 bytes. The nodes are inter onne ted with IBM's fast proprietary Colony swit h whi h has a measured MPI bandwidth of 350MB/se , and a laten y of 17 s The nodes run AIX 4.3 with the IBMMPI and OpenMP implementations. Native IBM ompilers are used, with xlC r and xlf r used for C++ and Fortran ode respe tively. The ompilers are wrapped in the MPI mpCC, and mpF ompiler wrappers. The C++ ode was ompiled with the following options: -O3 -qstri t -qar h=auto -qtune=auto -qsmp=omp:noauto while the Fortran ode was ompiled with the -O3 -qstri t -qar h=auto -qtune=auto -qsmp=omp:noauto -u -qnosave -q64 ags. IV.B Performan e Side E e ts We believe the hybrid model will provide the bene ts of both message passing and shared memory with hopefully none of the de ien ies. However, there are some side e e ts that will be aused by using a proxy thread. Be ause one of the pro essors has been taken out of the ompute pool to dedi ate toward running 32 ommuni ation, this ne essarily auses the omputation time on the remaining omputational threads to in rease (i.e.: the other three CPUs on LeMieux, and the other seven CPUs on Blue Horizon). We an model an expe ted level of performan e as follows: Let Ts = total running time in the syn hronous model Ct < T = time spent ommuni ating p = number of CPUs per node s = number of CPUs dedi ated to running a proxy per node Then the theoreti al time on the proxy model Tp is de ned as: Tp = max((Ts Ct) p p s ; Ct)3 E.g.: we are s aling the omputational time by the ratio given that we are losing s pro essors to run proxy ommuni ators. Assuming one would utilize all the p CPUs on a node we an expe t to lose s p omputational performan e. For example, with Blue Horizon, ea h node has eight pro essors. Assuming ea h CPU is utilized to its full extent (100%), we expe t to see a 1 8 = 12:5% rise in omputational time for the seven other omputational threads when used with a proxy thread running dedi ated on the eighth pro essor. Similarly, with LeMieux, we expe t to see a 1 4 = 25% rise in omputational time on the remaining three omputational threads. This performan e model was originally derived by Fink [11℄ and Baden [3℄. In addition, sin e all the CPUs within a node share the same main memory, problems with shared memory ontention arises. As the node grows wider (e.g.: more pro essors sharing the same bus) this shared memory ontention ost grows geometri ally. In ma hines with two or four CPUs per node (su h as LeMieux), 3It is the maximum of the two expressions sin e there are ases in ommuni ation-intensive algorithms where the ommuni ation time spent by the proxy thread dwarfs the omputational time spent by the other threads, thus making the overall time the same as the proxy ommuniation time. 33 this is generally not too mu h of a problem. As the on gurations grow wider though, this an be ome detrimental espe ially on ma hines su h as Blue Horizon whi h is 8-way, or another NERSC ma hine Seaborg whi h is 16-way. This is often alleviated by having multiple ports to memory, allowing CPUs to simultaneously a ess memory without being bottlene ked. On Blue Horizon, ea h group of four CPUs within a node share the same memory port. For our ase, sin e we are running one proxy thread whi h deals mostly with ommuni ation and doesn't have the same memory bandwidth requirements as the omputational threads, this presents a load imbalan e problem in that three omputational threads would share one port, while the remaining port would be shared among the other four omputational threads. Communi ation-wise, though we noted in Se tion III.C.4 that aggregating ommuni ations onto one thread (the ase where s=1 in the performan e model presented) an result in lower laten y with higher bandwidth it must also be noted that in problems where the hief onstraint is bandwidth (as opposed to CPU, or memory bounds), often the aggregate bandwidth is lower than when using multiple pro esses to ommuni ate [9℄. One other side e e t must be noted in even using simple overlap using MPI asyn hronous alls. It is ommon for asyn hronous ommuni ations to ause an in rease in the omputational time, even without the proxy. This an be attributed to a few omponents. The asyn hronous transfer of data ould ause network interrupts, ausing the system to ontext-swit h and possibly use time to manage the transfer of ommuni ation or lose a he lo ality in the servi ing of the ommuni ation. Additionally, the network interfa e must get the data ba k into memory whi h pla es an additional burden on the memory system de reasing the available memory bandwidth for the omputational threads. We also noted that IBM's MPI implementation on Blue Horizon seems to be unable to overlap ommuni ation using MPI's simple non-blo king asynhronous alls. Indeed, it a tually auses the ommuni ation time to in rease. 34 Model Total (s) Per Iteration (ms) Total Comm. Wait Ideal Proxy Total Relax Comm. Syn 2.95 1.07 0.88 2.36 178 133 70 Asyn 3.60 1.72 1.63 2.25 225 133 108 Table IV.1: RedBla k3D on Blue Horizon, Average Times for 4 nodes 16 iterations, N=400) showing IBM's MPI implementation's la k of ability to realise overlap when using immediate mode asyn hronous ommuni ation IV.B.1 Optimized Hardware Clusters an have a varying level of inter onne ts, ranging from regular Ethernet up to highly spe ialized networks su h as IBM's Colony, or the Quadri s QsNet inter onne t (used on Blue Horizon and LeMieux respe tively). Both are designed for high bandwidth, and low laten y ommuni ations. In the ase of the Quadri s inter onne t, the network interfa e Elan is as intelligent as some nodes in previous super omputers! Its ore onsists of a exible and powerful dedi ated opro essor whi h an be programmed as the user wishes. As su h, Compaq's MPI implementation on LeMieux is highly optimized to exploit the power of Elan. This extra o-pro essors is powerful enough to be thought of as a ommuni ations proxy in its own right employing its own a he, CPU, and memory. This may have a side e e t in that optimizing the algorithm for use of a software proxy may attempt to ir umvent the very usefulness of having the Elan interfa e, and may result in adverse e e ts as the Elan o-pro essor, having been programmed by the MPI library to respond favourably to a single-tier ommuni ation model, attempts to handle the abnormal ommuni ations pattern of our multi-tier asyn hronous proxy ommuni ation model. 35 IV.C Redbla k3D IV.C.1 Blue Horizon We rst ran some small ben hmarks on problems to evaluate the ability to a hieve overlap using base MPI immediate mode asyn hronous alls. From the results shown in Table IV.1, we observed that the MPI implementation on Blue Horizon was unable to realize overlap using the asyn hronous ommuni ation primitives. Be ause of this, omparisons are made hie y between the syn hronous single-tier and proxy multi-tier models. Before we ran experiments, it was helpful to see how the RedBla k3D kernel would s ale thread-wise. These results would then give us a ballpark gure as to what sub-problem size per node would be most optimal. This enables us to see how well the RedBla k3D kernel parallelises a ross multiple threads a essing the memory sub-system, as well as what overheads (if any) the thread layer would reate. To examine how the RedBla k3D omputational kernel s aled, we ran the kernel itself on one node only with no ommuni ation. We ran it varying from 1 to 8 threads, with the results shown in Table IV.2. These results show that at a smaller problem size per node (2403), the problem s ales poorly due to the size of the problem. It's just too small to e e tively parallelize a ross 8 pro essors. At 3603 we see the problem start to s ale more e e tively. At 4 threads, the running time has de reased from 30.658s to 7.888s, a speedup of 3.886 very lose to the linear speedup value of 4 we would hope to a hieve. At 7 threads, we attain a speedup of 6.316. This is the value we are most interested in, sin e we will be taking o 1 thread to run the ommuni ation proxy. At 4803 we see that the problem starts to s ale mu h less. At 7 threads, we have attained a speedup of 4.935 hardly the linear speedup value of 7 we would hope to a hieve. The graph in Figure IV.1 shows the speedup values plotted in Table IV.2. Sin e this is run only on one node, eliminating o -node ommuni ation, we an observe that the la k of speedup at 7 threads on the larger sized problem is ontributed to mostly by the overhead of the thread system. 36 Problem Size Number of OpenMP Threads 1 2 3 4 5 6 7 8 240 4.94 2.67 2.10 1.94 1.56 1.32 1.21 1.04 360 30.66 15.25 10.27 7.89 6.35 5.33 4.85 4.14 480 44.20 21.51 16.63 15.35 12.25 10.29 8.96 8.01 Table IV.2: Running time in se onds, showing the e e t of varying threads on the RedBla k3D kernel to observe s aling of the memory system and thread system overhead on Blue Horizon Figure IV.1: Speedup on the RedBla k3D kernel for Blue Horizon 37 Problem Size Threaded MPI HWP Total Total Comm 120 0.10 0.08 0.03 240 1.06 1.11 0.18 320 2.51 2.62 0.38 360 4.81 3.93 0.87 400 4.87 5.11 0.72 480 8.36 9.10 0.95 540 11.94 n/a n/a 600 16.33 n/a n/a Table IV.3: RedBla k3D on Blue Horizon, MPI HWP vs. OpenMP LWP ( ommuni ation on the proxy is always zero for 1 node The plot in Figure IV.1 shows that to obtain lose to linear speedup, the problem size per node must be great enough that there is enough work to parallelize, but not too great as to ause saturation of the bandwidth to the shared main memory system. At our problem size of N=8003 running over eight nodes, this equates to a per-node sub-problem size of N=4003, whi h obtains reasonably lose to linear speedup. We also experimented with omparing the performan e of running eight MPI heavy-weight pro esses per node (1 per CPU) ommuni ating using the MPI message-passing via shared memory implementation versus running one MPI proess per node spawning eight light-weight OpenMP threads. Essentially, we are ben hmarking message passing at a single-tier level versus the threaded model, with the intent of measuring thread system overheads. This also gives an idea of how the pro essor sub-problem strides and geometry a e t the performan e. This ben hmark shows more learly, and justi es the use of OpenMP to manage intra-node ommuni ation, by eliminating the o -node ommuni ation. From this data shown in Table IV.3, we an see that the proxy outperforms the syn hronous model slightly at N = 2403, and more so at N = 4803. At N = 4803 the ommuni ation ost for the syn hronous model has started to take up an reasonable fra tion of the running time at almost 1 9 of the total time a ost 38 the proxy model does not have to bear. However, at 3603, the single-tier solution gives an 18% performan e in rease over the multi-tier, despite being slower at both the smaller and larger problem sizes of 2403 and 4803. We believe this is due to what is essentially bad lu k in hitting a poor stride for the sub-problem. With the multi-tier model, the threads operate on a single large matrix whi h reates larger memory strides in a essing subsequent rows. This an ause some TLB on i ts between the multi-pro essors as they try to a ess the shared memory. At the single-tier model, sin e ea h pro ess has its own isolated sub-problem, the per-pro ess sub-problem is 1 8 th the size of the multi-tier model's single large matrix. This allows for ea h pro essor on the node to have smaller memory strides whi h are more optimal. We noti ed an interesting problem at N = 6003 where the pro ess rashed when trying to allo ate enough memory for 8 subproblems of 6003 8 , but it su eeded when reating a single 6003 matrix and sharing over 8 lightweight OpenMP threads. The node has 4GB of memory, and a problem of 6003 (assuming 8 byte doubles) will ome very lose to lling that sin e the node has to allo ate both the main matrix and the right hand side matrix. The lower overhead of reating LWP threads seems to enable a larger working-spa e versus the heavier MPI pro esses, though this is a small example. Table IV.4 summarises results typi al of a run of RedBla k3D for 8 iterations, on 8 nodes (64 pro essors), with a problem size of 8003 using double-pre ision oating point; the table shows the total time spent ommuni ating, the total time of the run, the per entage of time spent ommuni ating as well as the total time spent waiting for ommuni ation to nish. For the syn hronous and asyn hronous versions, the per entage of time spent ommuni ating is the total time for overhead of ommuni ations and the time to wait for that ommuni ation to nish (laten y). On the proxy, sin e the proxy thread initiates ommuni ation on behalf of the node, the overhead experien ed for ommuni ation primitives on the other threads is zero, and the ommuni ation time onsists solely of the time spent 39 Model Total (s) Per Iteration (ms) Total Comm. Wait Ideal Proxy Total Relax Comm. Syn 11.81 3.58 2.61 10.51 738 557 224 Proxy 10.20 0.53 0.53 637 606 33 Table IV.4: RedBla k3D on Blue Horizon Average Times, 8 nodes 16 iterations, N=800 in the omputational thread waiting for ommuni ation to nish. There is still overhead involved in terms of thread syn hronization. For the syn hronous and asyn hronous runs, the performan e model des ribed in the previous se tion was applied to determine an ideal proxy performan e time. Table IV.4 also shows the breakdowns of iterations for ea h model, showing the time spent in omputation vs. ommuni ation. The jobs were run with suÆ ient iterations to ensure any "warm-up" e e ts were amortized out, and to make sure runs lasted at least a onsistent amount of a few se onds to ten se onds. Times were reported as the average of 16 runs, with outliers dis arded. We de ned outliers as runs where the total time was at least 20% slower than the average of the other runs. Wall lo k times were observed with MPI Wtime(). For the 8003 ase, we observed that the relaxation time ( omputational phase of the RedBla k3D algorithm) in reased by about 8.8% when going from the syn hronous algorithm to utilizing the proxy. Sin e we took o one CPU to run the proxy, we predi t that the system loses about 12.5% of its omputational power, so our a tual observed value of 8.8% is lose to this predi ted value. The ommuni ation time de reased onsiderably by about 85%, bringing the total (per iteration) time down by about 13.7%. It is interesting to note the relaxation time didn't go up as mu h as we predi ted, this suggests that the problem is memorybandwidth onstrained, rather than CPU onstrained. Given the performan e of the syn hronous ase, the ideal proxy performan e (modelled in the performan e model dis ussed above) should have been 40 Model Total (s) Per Iteration (ms) Total Comm. Wait Ideal Proxy Total Relax Comm. Syn 12.41 4.40 3.43 10.26 775 558 274 Proxy 12.89 1.21 1.21 806 740 76 Multi-tier 14.30 3.20 3.20 894 704 201 Table IV.5: RedBla k3D on Blue Horizon, Average Times, 16 nodes 16 iterations, N=1008 10.5s. Our observed running time was 10.2s, thus showing that the shared memory model, in addition to eliminating part of the ommuni ation osts, helped speedup other parts of the algorithm, most likely related to the ost savings when using shared memory a ess. We eliminated 1.51s o the syn hronous time of 11.8s, whi h an be attributed to the di eren e between eliminating the ommuniation o set by the in reased relaxation omputational time for an overall speedup of about 13.6%. We next attempted to s ale the workload up to 16 nodes, by maintaining roughly the same sub-problem size per node of 4003 doubles. This makes for an aggregate problem size of N=10083. However, at this problem size we see the proxy performan e being unable to mat h the performan e of the baseline syn hronous model. Our results whi h an be seen in Table IV.5 show that while the relaxation time per iteration for the syn hronous model (running at a 4x4x8 geometry) remains onstant at around 560ms per iteration, the relaxation time for the proxy model jumps from 600ms to 740ms. It was surprising to see the relaxation time jump up this mu h; upon taking a loser look at the runs, it seems part of the problem an be laid down to a he lo ality. The 8003 problem allowed us to run at a geometry of 1x2x8 (in the X x Y x Z dimension order), whi h breaks the problem up into 3D sub-matri es of 800x400x200. At the 10083 problem, we tried a few di erent geometries with the best results (Table IV.5 being in the 1x4x4 geometry whi h gives us a sub-matrix size of 1008x252x252. The matrix is laid out su h that x is the nearest dimension, with the y-dimension being o by a stride 41 of x, and the z-dimension being o by a stride of x y. Therefore, the best a he lo ality omes from having a longer x-partition, hen e the need to partition with 1 pro essor along the x-axis. This leaves a Y x Z geometry of 1x16, 2x8, 4x4, 8x2, or 16x1. Sin e the node divides the sub-problem internally along the y-axis, it would be more bene ial to avoid over-partitioning of the y-dimension, whi h leaves either 1x16, 2x8, or 4x4 as the optimal Y x Z geometry. The 1x16 and 2x8 partitions leave a Z-dimension of size either 63 or 126, whi h is not enough to bene t from a he pre-fet hing. This leaves 1x4x4, whi h while it gets the best performan e of the proxy geometries, is unable to mat h the syn hronous performan e due to the smaller sizes of the y-dimension partitioning. At 1x4x4, ea h node gets a 252-row blo k in the y-dimension, whi h means ea h pro essor works on roughly a 252 7 = 36 row sub-blo k whi h does not enjoy the same a he bene ts as the smaller problem whi h has a y-dimension blo k of 400 7 = 57 whi h allows for more lo al omputation to be performed. Our results obtained were similar to the ones obtained by Baden and Shalit on the same ma hine [5℄. With 8 nodes at a problem size of 800, they a hieved total iteration times of 626ms when using the multi-tier variant with overlap, proxy, and irregular partitioning while ours a hieved 637s. Their singletier syn hronous version also a hieved omparable performan e at 732ms while ours a hieved 738ms. We attempted a run at 24 nodes, but saw ontinued performan e degradation along the same vein as in the 16 node ase. Our 24 node runs ompleted in 13.69s on the syn hronous single-tier implementation, whereas the proxy nished in a slightly slower 15.13s. This was still however, faster than the asyn hronous MPI implementation. We implemented a multi-tier model without using overlap and without using a proxy thread (similar to Baden and Shalit's MT(k)+!OVLP model) to ensure that the proxy was not ausing unne essarily high overheads. Our relaxation times ompared similarly (shown in the last row of Table IV.5, with the multi-tier 42 w/o overlap or proxy model getting relaxation times of about 700ms. The proxy's relaxation time is slightly higher due to the loss of a omputational thread. This result seems to indi ate that the high spike in rease in relaxation time is not due to the proxy thread or overlap. Instead, it seems to be due to some sort of overhead penalty in urred by the shared memory OpenMP thread system. Part of this may be due to the memory stride of the sub-problem in going to a multi-tier aware model. The single-tier model splits up the sub-problems to a smaller size allowing for smaller strides thus in reasing the performan e and lo ality (by redu ing the number of misses) of the a he as well as the TLB. It seems that the proxy is hard pressed to get performan e out of more than the eight nodes we ran on. At larger problems we run into a he problems due to non-optimal geometry partitioning, as well as in reasing the impa t of losing omputational threads. At smaller problems, we run into the problem where there is not enough omputation to amortize the ommuni ation osts away. There is a balan e that must be stru k between smaller problems whi h too little omputation and larger problems with too mu h omputation. IV.C.2 LeMieux The MPI implementation on LeMieux is supposedly apable of realizing overlap in the baseline asyn hronous MPI model, so we were able to ben hmark some more interesting results. Due to there being four pro essors per node, versus the eight in ea h Blue Horizon node, we expe ted the proxy to have a more signi ant impa t in the loss of omputation sin e there would now be a loss of 25% aggregate omputational power versus the 12.5% ost for a Blue Horizon node. We performed similar kernel ben hmarks on LeMieux as we did on Blue Horizon in order to see how well the memory system would s ale. We suspe t similar results to Blue Horizon, as the node on gurations are similar. While LeMieux has four pro essors per node versus Blue Horizon's eight, the port to memory is similarly half (LeMieux's pro essors share one port to memory, similar 43 Problem Size Number of OpenMP Threads 1 2 3 4 240 3.36 na 1.98 1.70x 1.57 2.14x 1.49 2.26x 360 12.92 na 7.42 1.74x 5.69 2.27x 5.00 2.58x 480 30.08 na 17.16 1.75x 13.40 2.25x 12.01 2.51x Table IV.6: Running time (in se onds) and speedup showing the e e t of varying threads on the RedBla k3D kernel to observe s aling of the memory system and thread system overhead on LeMieux Figure IV.2: Speedup on the RedBla k3D kernel for LeMieux 44 Problem Size Multi-tier Single-tier Total Relax/iter Total Comm Relax/iter 304 2.89 .36 3.29 0.48 .35 360 4.94 .62 5.54 0.89 .59 424 8.09 1.05 8.95 1.06 1.01 456 10.34 1.31 11.41 1.61 1.21 480 12.02 1.49 13.31 1.84 1.38 Table IV.7: RedBla k3D on LeMieux, MPI HWP vs. OpenMP LWP ( ommuniation on the proxy is always zero for 1 node, 8 iterations. 3043 orresponds to 7683 over 16 nodes, and 4563 orresponds to 11523. to Blue Horizon's on guration where ea h set of four has its own port to main memory). Sin e RedBla k3D is more onstrained by memory, we expe ted the results to be similar given similar memory subsystem performan e. The data in Table IV.6 shows that the nodes on LeMieux seem to peak at around a 2.6 times speedup. Going to 4 threads on one node versus 1 thread only o ers a 2.2 to 2.6 times speedup for any of the three sample problem sizes. At the proxy on guration (3 threads), the best speedup attained was at 3603 where we attained a 2.27 times speedup going from 12.924s to 5.686s. Sin e o node ommuni ation is eliminated in this ben hmark, these results seem to verify that the problem seems to be more memory bound, rather than CPU bound thus onstraining the speedup of having multiple threads pro essing the data. We again ran the same MPI single-tier versus OpenMP multi-tier omparison on a single node in order to better understand the issues of how lo al problem stride and geometry might a e t performan e of the RedBla k3D kernel. These results an be seen in Table IV.7. We took our later problem sizes of 7683 and 11523 run over 16 nodes, and s aled them down to roughly the single-node sub-problem sizes of 3043 and 4563. We also ran at a ouple intermediate values (3603 and 4243) as well as a slightly larger problem size of 4803. The table shows the total runtime for the multi-tier (threaded) model, and the relaxation time per iteration. It also shows the total runtime for the single-tier (MPI syn hronous) 45 model along with the total time spent ommuni ating, as well as the relaxation time per iteration. From these results, one an see that the omputational time on the multi-tier threaded model is slightly higher than the single tier model. In the multi-tier model, the data set is allo ated as a single shared large threedimensional matrix, leading to longer strides than the smaller sub-problems when allo ated per pro ess in the single-tier model. The shorter strides prove bene ial to the single-tier solution in redu ing the relaxation times. However, overall, the total run-times tend to be smaller for the multi-tier threaded model due to the elimination of intra-node ommuni ation. Our results of running RedBla k3D in syn hronous, asyn hronous, and proxy modes an be seen in the following tables. Table IV.8 shows the run-times when a problem size of 8003 is run on 8 nodes (4 pro essors ea h) for 16 iterations, the same on guration we ran on Blue Horizon. Table IV.9 shows the performan e when run for 256 iterations on sixteen nodes at both 7683 and 11523. At 8 nodes at 8003, we see a surprising result: asyn hronous version running slower in both omputation AND ommuni ation in omparison to the syn hronous version. It appears that the ost of running both omputation and ommuni ation simultaneously for a e ts the node's performan e very negatively. This seems to be an indi ator that overlap using regular asyn hronous MPI ommuni ation primitives (Isend/Ire v) is not possible, an observation we noted with Blue Horizon. With the proxy model however, we were able to eliminate almost entirely the ommuni ation wait time, de reasing it from 17% to a negligible 0.2%. This in reases the omputation time per iteration by about 10%. The in reased omputation time dire tly leads to the fa t that there is almost a zero wait time as the loss of 25% aggregate omputational power auses the in reased omputation to ompletely overlap the ommuni ation. Overall though, a positive result is attained resulting in an overall speedup of 5.13%. For the 16 node on guration, at the smaller problem size 7683, we see that the proxy, while out-performing regular MPI syn hronous ode, does not out46 Model Total (s) Per Iteration (ms) Total Comm. Comm. % Wait Ideal Total Relax Comm. Syn 16.56 2.81 16.9% 2.51 18.73 1034 885 178 Asyn 17.18 3.48 20.2% 3.38 18.4

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005